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Tight  Area  Bounds  and  Provably  Good  AT^  Bounds  for  Sorting  Circuits 

Alan  Siegel 
Courant  Institute 

Abstract 

This  paper  gives  tight  upper  and  lower  bounds  for  the  minimum  area  required  to  sort  n 
it-bit  numbers  in  a  digital  medium,  when  the  inputs  can  be  replicated  up  to  r^n  times.  We 
also  give  provably  good  AT^  bounds  for  VLSI  sorting  circuits  that  read  their  inputs  once. 
Our  lower  bounds  result  from  a  coherent  theory  that  captures  the  intrinsic  complexity  in  both 
A  and  Ar^  measures  for  sorting  circuits.  Among  other  results,  we  prove: 

(1)    The  minimum  area  A(n,k,r)  required  to  sort  n  Jfc-bit  numbers,  which  can  be  read  r 
times,  satisfies 


Ain,k,r)  =  0 


logn  +  2*log— 2i-,     2*sn/r, 

logn  +  ilogiSlIi,      n/rs2*^/j^(^>. 


(2)     A  VLSI  circuit  that  sorts  n  k-bit  numbers,  for  lognsfc<21ogn,  and  that  reads  its  inputs 
exactly  once  satisfies 


AT2=ft(n2log2-i^). 


1.  Introduction 

"Minimum  Storage  Sorting  Networks"  [SI]  analyzes  the  problem  of  how  to  sort  n  fc-bit 
numbers  in  a  minimum  area  VLSI  circuit.  Lower  bounds  for  the  circuit  area  are  established, 
and  sorting  circuits  are  described,  which  require  an  area  that  is  within  a  constant  factor  of 
optimal.  These  results  are  based  on  the  fact  that  the  inputs  are  read  only  once.  We  now 
analyze  the  problem  when  the  inputs  are  read  r^n  times.  This  problem  is  clearly  more  diffi- 
cult, and  its  solution  requires  new  techniques  to  measure  how  much  information  a  sorting  cir- 
cuit must  be  able  to  retain. 

In  [SI],  bounds  are  established  by  showing  that  a  circuit,  at  times,  has  to  store  a  certain 
amount  of  information,  lest  it  fail  to  sort  correctly.  Also,  a  Testing  Lemma  is  used  to  verify 
the  presence  of  stored  information.  It  is  based  on  "testing",  the  ability  to  assign  a  variety  of 
different  values  to  bits  which  have  not  yet  been  input  to  the  circuit  by  a  specific  time.  The 
advantage  in  testing  is  that  it  provides  a  simple  alternative  to  the  sometimes  arduous  task  of 
constructing  fooling  sets,  as  formalized  in  [U].  Unfortunately,  these  simple  techniques  seem 
to  be  inapplicable  when  inputs  are  read  more  than  once.  Therefore  proofs  in  this  paper  are 


based  on  counting  arguments  similar  to  fooling  set  constructions. 

This  choice  of  proof  style  has  advantages.  The  techniques  are  more  general,  and,  we 
believe,  more  insightful.  For  example,  Theorem  2  not  only  provides  the  foundation  for  all 
our  minimum  area  results,  but  more  importantly,  it  establishes  the  central  proof  (and  proof 
technique)  for  our  AT^  bounds. 

In  our  VLSI  model,  the  area  measure  is  sufficient  to  store  one  bit  per  unit.  I/O  ports 
are  counted  as  storage  devices,  and  also  require  a  unit  of  area.  Wires  have  unit  width  and 
take  unit  time  to  transmit  a  bit,  independent  of  length.  The  I/O  schedixle  is  assumed  to  be 
when-  and  where-determinate,  i.e.,  the  time  and  location  for  each  input  and  output  bit  are 
prespecified.  Our  area  bounds  are  proved  for  multilective  input;  that  is,  data  can  be  read 
more  than  once.  All  computations  and  temporary  storage  of  data  must  be  done  within  the 
sorting  device.  We  note  that  although  our  model  is  typical  of  VLSI  characterizations,  (and 
our  language  frequently  refers  to  such  a  model,)  the  storage  bounds  are  valid  for  a  bit  model 
sorting  network,  and,  in  fact,  any  digital  sequential  sorter. 

For  an  essentially  planar  network  model,  such  as  VLSI,  minimum  area  and  minimum 
storage  for  sorting  turn  out  to  be  within  a  constant  factor  of  one  another;  for  this  reason,  our 
minimum  storage  bounds  are  always  denoted  by  A .  In  the  AT'  complexity  measure,  how- 
ever, A  denotes  circuit  area  (not  storage)  and  T  denotes  the  circuit's  rimning  time;  in  this 
case,  area  can  exceed  storage  by  more  than  a  constant  factor.  Also,  the  input  schedule  is 
required  to  be  seraelective  in  this  case,  that  is,  data  is  read  only  once.  More  complete  infor- 
mation about  these  models  can  be  found  in  [U]. 

The   input   variables   are   named  Xi^2 -^n-    ^^   sorted   output  variables   are 

Y^.  .  .  .  ,Y„,  where  K^syj^  •  •  •  SY„.  The  bits  of  Yj  are  Yjj,  where  y^  ^  is  the  most  signifi- 
cant bit  (msb).  The  msb  is  sometimes  called  the  I'st  most  significant  bit,  and  Yjj^  is  called  a 
I'st  least  significant  bit  (Isb).  When  it>logn,  we  call  the  logn  more  significant  bits  of  an  X 
or  Y  variable  the  more  significant  bits,  and  define  the  less  significant  bits  to  have  significance 
1  +  logn  or  more.  Sometimes  the  distinction  between  more  and  less  significant  bits  is  made 

relative  to  log—,  rather  than  logn. 

Our  basic  proof  strategy  is  to  measure  how  much  additional  information  is  needed,  in 
specific  cases  where  a  subset  of  the  K-bits  are  to  be  output,  and  a  subset  of  the  X-bits  are 
known  "for  free." 

A  few  definitions  will  help  simplify  the  subsequent  exposition.  Let  a  circuit  compute  a 
boolean  vector  function  f:X-Y .    Then  I{f{X))  is  the  logarithm  of  the  size  of  /'s  range: 


I(f)  =  'iog\{y\y=f(x)}\.  If  P  is  a  predicate,  then  I(f\P)  is  the  logarithm  of  the  size  of  /'s 
range  when  the  domain  of  /  is  restricted  to  {X  \P(X)}.  U  P  =  {p  Jj^j  is  a  partition  of  X,  then 
I(f\p,)  =  I(f\Xip,). 

Given  n  k-hit  variables,  X"{X,}f.i  and  an  n-tuple  n=(jC|,jc;,. ..,ac„)  of  K-bit  numbers, 
where  k^*,  we  say  11  is  a  higher  (lower)  order  assignment  to  X  if,  for  i=l,2,...,n,  the 
number  comprising  the  k  more  (less)  significant  bits  of  X,  equals  jc,.  Given  a  multiset  £  of  n 
K-bit  numbers,  where  KSJk,  we  say  X  has  a  higher  (lower)  order  restriction  to  2,  if  the  higher 
(lower)  order  assignment  to  X  is  an  n-tuple  comprising  a  permutation  of  2.  If  5  is  a  set  of 
n -tuples,  X  is  restricted  to  S  means  the  X's  can  be  assigned  any  tuple  in  S.  Given  a  subset  x 
of  the  bit  variables  in  {X,}f^^,  an  assignment  to  x  is  an  assignment  of  values  to  x-  If  *  is  an 
assigimient  to  x,  and  2  is  a  multiset  of  n  /-bit  numbers,  then  I.\j(  is  the  collection  of  X- 
assignments  (n -tuples)  comprised  from  permutations  of  2  that  agree  with  i?  on  x-  Finally, 
the  symbol  o  is  used  to  denote  a  kind  of  tuple  concatenation:  if  S={(a,b,c),(d,eJ)},  and 
ll=(x,y,z),  then  Soil  =  {(ax ,by ,cz) ,(dx ,ey Jz)} ,  where  ax  means  bit  string  a  concatenated 
with  bit  string  jc. 

2.  Lower  Bounds  for  Area 

To  simplify  the  exposition,  we  shall  assume,  in  this  section,  that  the  I/O  schedule  inputs 
and  outputs  at  most  one  bit  per  time  step.  Since  I/O  pads  consume  unit  area,  concurrent 
inputs  can  be  viewed  as  serial  inputs  into  a  shift  register.  Our  area  (only)  bounds  never 
include  area  consumed  by  wires,  and  consequently  there  is  no  cost  attributed  to  routing  the 
stored  inputs  to  their  destinations.  (Furthermore,  such  a  routing  would  only  affect  the  area 
by  a  constant  factor.)  Thus  the  area  boimds  for  the  simplified  schedules  also  apply  to  cir- 
cuits using  arbitrary  determinate  schedules.  Similarly,  we  may  assume  that  input  and  output 
events  never  occur  at  the  same  time  step.  It  should  be  noted  that  the  input  schedule  is  not 
assumed  to  be,  say,  r  copies  of  a  schedule  where  each  input  occurs  once.  Each  bit  of  an 
input  variable  can  have  its  r  input  instances  read  in  any  determinate  way,  and  could,  of 
course,  have  fewer  than  r  instances. 

Theorem  1. 

A  when-  and  where-determinate  sorter  of  n  k-hit  numbers  that  inputs  its  data  r 
times,  where  2*<  —  satisfies  A  =  ft(2Mog— 2— ). 

Proof:  It  suffices  to  assume  that  2*^n/2r.  Choose  a  time  interval  t- [7^,72]  so  that  during 
the  time  f  €t,  n/2r  Isb's  are  output  by  the  circuit,  and  no  more  than  n/2  least  significant  bits 
are  input.    Without  loss  of  generality,  we  may  assume  that  n/Ar  of  the  bits  output  during  t 


belong  to  variables  Yij^,  where  i^n/2.  (An  analogous  argument  holds  for  i>n/2.)  Let  those 
n/4r  Isb's  comprise  the  set  of  variables  '*={!',  ^,y,  ^,  .  .  .  ,y,    ^}.    Name  the  inputs  so  that 

the  least  significant  bits  read  during  t  are  X^^^jj^,  .  .  .  ,X„^,  where  aS/i/2.  By  assigning 
X(=2*-l  for  l<isa,  we  can  guarantee  that  the  outputs  in  ^  are  independent  of  the  Isb's 
read  during  t.  The  circuit  must  therefore  have  a  memory  at  least  as  large  as  the  information 
content  of  the  outputs  of  these  n/4r  least  significant  K-bits.  Formally,  we  assign  values  to  the 
remaining  {Xj}J.^+^  in  batches.  Put  iQ=0,  and  let  J;-j,_i  of  these  X's  equal  ^,,  for 
/=l,2,...,/i/4r.    Set  the  remaining  n-a-i^^  X's  to  2*-l.   Thus  1'/  =  ?/.   Now  the  chip  need 

not  "remember"  n/4r  distinct  assignments  to  the  Isb  input  variables  4/^,  because  many  of  the 
^  variables  must  have  identical  values  (if  2*«n/2r).    We  complete  the  proof  by  assigning 

values    to    the    ^'s    in    batches.     Put    ii  =  — - — ,    and,    for    j  =  0,1,.. . ,2*"^- 1,    let 

4r2*~^ 

?y^,|y^+i,...,€y^+^_i  have  the  number  j  comprising  its  ife-1  most  significant  bits.  This  /th 
batch  has  0  assigned  as  the  least  significant  bit  for  the  first  Ij  ^'s,  and  1  assigned  for  the  T|-/y 
remaining  ^'s.  Thus  for  each  j,  Ij  is  the  number  of  output  variables  in  '4',  which  have  j 
assigned  to  the  k- 1  more  significant  bits,  and  which  have  0  assigned  to  the  least  significant 

bit.   Since  Ij  can  be  any  value  from  0  to  — 2_^  the  number  of  such  distinct  outputs  is  at  least 

2»-' 


1+- 


whence  taking  the  logarithm  gives,  ft(2*log — - — ).  o 

Theorem  2. 

A  when-  and  where-determinate  sorter  of  n  k-hit  numbers  that  inputs  its  data  r 

times,  where  -<2*<(-)2,  satisfies  A  =  a(—log-^^^). 
r  r  r     "     n 

Proof:  Write  t]  =  n/r,  K  =  logTi,  and  Jfc=K-l-p.  We  now  refer  to  the  k  most  significant  bits  of 
each  input  nxmiber  as  the  more  significant;  the  p  least  significant  bits  are  less  significant.  In 
terms  of  the  new  variables,  the  area  bound  can  be  written  as  A  =  Cl(r]p).  It  evidently  suffices 
to  have  p<cK,  where  t  is  positive  and  will  be  specified  later.  In  view  of  Theorem  1,  we  may 
also  assume  that  p>6. 

We  want  to  find  a  time  interval  during  which  a  set  ^,  that  comprises  a  large  number  of 
less  significant  bit  variables,  is  output,  but  during  which  at  most  half  of  the  more  significant 
bits  are  read.  An  area  bound  can  then  be  found  by  computing  how  much  additional  informa- 
tion must  be  remembered  by  the  circuit,  if  its  outputs  are  to  be  correct.  Since  there  will  be  no 


restrictions  on  which  less  significant  bits  are  read  during  the  interval,  we  bound  the  uncer- 
tainty in  the  output,  essentially,  by  the  somewhat  informal  expression: 

/(^l  all  X-inputs  except  for  -^  more  significant  X-bits  have  fixed  known  assignments). 

The  principal  difficulties  in  estimating  this  expression  are  that  the  known  information 
can  be  asymptotically  much  larger  than  \'^\,  and  that  some  "large"  sets  of  less  significant  bit 
variables  are  so  highly  correlated  that  their  information  content  is  too  small  to  give  an  ade- 
quate estimate.  Consequently,  simple  counting  estimates  seem  to  be  insufficient  for  estab- 
lishing the  result,  and  a  formal  fooling  set  construction  seems  to  be  needed.  Furthermore, 
some  notion  of  the  correlation  among  bit  variables  must  presumably  be  included  in  the  argu- 
ment. 

The  construction  has  five  steps: 

(1)  Find  set  ^  of  ti-§-  less  significant  output  bits  that  have  a  total  correlation  cost  <l/4,  and 

0 

that  are  output  in  a  time  interval  t  when  at  most  -^  more  significant  bits  are  input. 
(The  correlation  cost  measure  for  an  s-th  significance  output  bit  turns  out  to  be  2"^) 

(2)  Assign  a  skewed  distribution  of  values  to  the  Y  outputs  so  that  the  ordering  of  the  Y's  is 
independent  of  the  assignments  to  the  bits  in  '4'.  (Thus  if  the  most  significant  bit  vari- 
able that  belongs  to  both  Y;  and  ^  has  significance  s,  then  make  sure  (essentially)  that 
l',+J^>K,-(-2*■^^"^  The  s  more  significant  bits  of  Y,  are  set  accordingly,  for  i=l,2,...,n.) 

(3)  Choose  n/4  input  variables  that  have  the  fewest  number  of  more  significant  bits  read 
during  t. 

(4)  Partition  the  selected  input  variables  into  sets  having  almost  identical  bit  sets  read  dur- 
ing T,  and  assign  to  large  subpartitions  output  values  that  appear  identical  on  the  bit 
sets;  then  make  the  various  lower  significance  bit  assignments  as  uniformly  distributed 
as  possible. 

(5)  Show  that  because  of  the  more  significant  X-bits  which  are  not  input  during  t  and  the 
less  significant  bit  assignments  to  the  X'i,  the  uncertainty  of  the  less  significant  bit 
values  for  the  Y's  within  each  subpartition  is  high. 

Let  Tf,,  h  =  0,l,...,6r  be  a  partition  of  time  so  that  during  each  of  the  time  intervals 

r€(T;,,T;,  +  J,  -^  bits,  of  significance  K-t-1  to  Jk,  are  output.   Let  C,,=         2)        2"^   C^  is, 

y^  ouqjut  during 

essentially,  the  cost  of  having  those  lesser  significant  bits  be  independent. 


6r-l 

Step  1.  Clearly   2  C/,  =     2       2   2"^<—  =  r,  so  at  least  Ir  of  the  C^'s  have  values 

A-O  isyan  K<jsi  "H 

less  than  1/4.  Furthermore,  during  some  time  interval  t  »  (t/,,T;,  +  J  corresponding  to  one  of 
those  C;,'s,  no  more  than  -^  bits  can  be  read,  which  are  of  1  to  k  significance,  since  the 

more  significant  bits  are  read  a  total  of  rnK  times.  Let  that  v£ilue  of  C^,  be  C.  Let  Sj  be  the 
set  of  significance  positions  of  the  less  significant  bits  belonging  to  Yj,  which  are  output  dur- 
ing T,  for  j=\,...,n,  that  is,  the  bit  variables  in  {yjj,}b(,s  comprise  the  less  significant  bits 
output  during  t.  This  set  comprises  the  less  significant  bit  variables  of  step  L 

Step  2.  We  now  assign  values  to  the  more  significant  bit  positions  of  the  Y  outputs.  Let 
Sj  =  k+l—imn{<»,biSj}.  We  temporarily  set  Y^=0,  and  provisionally  put,  for  y  =  2,3,...,n 


/ 


^r 


Yj_,+2''-\  if  Sj.,  a  sj^ 

Yj_^+2''-  [Yj_^  mod  2'^] ,    if  sj.^  <  Sj 


Evidently  Y„^2''*^C  <  2*,  so  the  i"s  take  on  permissible  values.  Note  also  that  Yj  =  Yj.-^ 
only  when  Sj=Sj-i=^.  Furthermore,  the  Sf  least  significant  bits  of  Yj  can  be  altered  arbi- 
trarily without  affecting  the  ordering  of  the  K's.  Let  the  set  V  consist  of  all  the  Yj's  with 
sj>0.  Any  YjiV  currently  has  its  Sj  least  significant  bits  set  to  0.  These  lower  significant  bit 
assignments  will  be  altered  in  Step  4.  The  assignments  to  bits  of  greater  significance  are  per- 
manent. 

Step  3  begins  with  the  observation  that  there  are  at  least  -j  input  variables,  each  having 
at  least  -^  more  significant  bits  which  are  not  input  during  t.  Now  the  number  of  different 
ways    -^    ^'ts    ^^^    bs    distributed    among   the    k    more    significant   positions    is    only 


ps] 


<('n)'°'^  2.'3<(t^)-92_   If  the  X's  are  partitioned  into  "pools",  where  pool  members  have 


an  identical  set  of  -|-  unread  more  significant  bit  positions,  then  the  mean  pool  size  is  at  least 

„.08^.92     ^j£  g  variable  has  more  than  -^  more  significant  bits  that  are  unread  during  t,  we 

can  arbitrarily  ignore  all  but  a  subset  of  size  ^.)  If  we  consider  only  X's  belonging  to  pools 

1  n 

of  size  —n'^^r'^'^  or  more,  then  at  least  —  X's  remain. 
8  8 

Step  4.    Values  can  now  be  assigned  to  the  X's.  We  avoid  matching  difficulties  by  sup- 
posing that  there  are  in  fact  more  than  n  "good"  ones,  rather  than  just  -^.   Accordingly,  each 

8 


pool  is  replicated,  say,  16  times.  Later  amends  will  trivially  rectify  this  overcount. 

Our  strategy  is  to  assign  values  to  almost  all  the  X's  in  a  pool  before  going  on  to  use  the 
members  in  another  pool.  Accordingly,  we  pick  a  pool  of  X's,  select  a  YjiV,  and  put  s  =  Sj. 

Let  (the  0*s  in  the  binary  number  mask)  B  denote  a  set  of  ^  more  significant  bit  positions, 

which  are  not  input  during  t,  for  members  of  the  X-pool.  Let  ji,  be  the  number  of  Yi^iV, 
which  have  the  same  bit  values  as  Yj  on  the  unmasked  bits  in  B,  and  for  which  s,^  =  Sj: 
IA,  =  KKj^^  V  I  Yj^i^  =  Yj^B,  s^=s)\.  If  ^t<2^  the  process  is  repeated  with  another  Y  selection. 
Otherwise,  the  matching  Y'%  are  removed  from  V  in  groups  of  V,  and  corresponding  X's  are 
removed  from  the  pool.  Each  X  is  assigned  the  k—s  leading  bits  of  its  corresponding  Y.  The 
less  significant  bits  of  a  2^-member  Jt-group  are  assigned  some  permutation  of  the  respective 
s-bit  values  0,1, ...,2^-1.  If  the  number  of  matching  Y'%  is  not  a  multiple  of  2^  it  suffices  to 
leave  that  last  remnant  Y  group  in  V. 

When  an  X  pool  has  fewer  than  V  remaining  members,  it  is  discarded  and  a  new  pool 
(of  size  a  2p)  is  started  along  with  new  selections  of  l"s  from  V.  Since  we  have  assumed 
that  the  number  of  X  pools  is  sufficient,  the  process  will  only  terminate  when  no  Y iV ,  for 
the  selected  (mask  defining  the)  X  pool,  gives  a  sufficient  count  for  jjl.  In  that  case,  the  total 
number  of  J"s  remaining  in  V  is  small.  For  each  value  j,  the  X  pool  (or  more  precisely  the 
mask)  partitions  the  set  {K^f  V  :  s,,=-s)  into  2*"'"'^^  equivalence  classes.  Evidently  each  class 
contains  fewer  than  V  Y'%.  Thus  the  number  of  less  significant  output  bits,  which  are  output 
during  t,  and  which  belong  to  the  Y'%  left  in  V,  is  bounded  by 


_  2,'3 

y3 


If  we  require  p2P<-4r-n^'^,  then  at  least  t]-^  output  bits  will  be  assigned  values  by  this  pro- 

cess.    The  other  constraint  on  p  is  that  the  pool  size  be  large  enough  to  have  2^  members: 
2^s2Ps-i-n08r-'2    jy^^^^  j^  is  sufficient  to  set  e=.07,  so  that  iL<2*<(iL)i.07  ^^^  ^^^^  ^>g 

or  n  large). 

The  accounting  for  the  excess  pools  of  X*%  is  now  simple.  For  each  of  the  16  multiples 
of  a  pool,  keep  the  one  whose  corresponding  Y  set  has  the  maximum  number  of  less  signifi- 
cant bits  that  are  output  during  t.  Let  the  set  V  comprise  those  Y  collections.  Thus,  the  K's 
in  V  have  at  least  T|p/192  less  significant  output  bits  which  can  have  various  values,  depend- 
ing on  the  unread  X  bits,  and  which  are  output  during  t.  All  remaining  l"s  are  paired  with 
unmatched  (genuine)  X's,  which  are  assigned  the  value  of  their  corresponding  Y'%. 
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Step  5.  By  construction,  V=UZ^,  where  the  union  is  over  disjoint  sets,  and  each  Z  con- 
tains (for  some  s)  exactly  2^  of  the  K-'s,  each  of  which  has  s.  =  s.  Moreover,  those  Y's  can 
have  their  s  least  significant  bits  assigned  any  permutation  of  the  j-bit  numbers  0,1,. ..,2^-1, 
depending  on  unread  bit  values  of  the  X's.  Let  b^  be  the  number  of  less  significant  bits  that 


belong  to  the  y's  in  Z^,  and  that  are  output  during  t:  b^  =    2)  \^j\-  Let  n^  be  the  number  of 
different  possible  output  configurations  (assignments)  for  the  b^  bits.  Then 


>  2f£:!:  =  ^-2'2^>2^-^-^^^'. 


2^2'-\        2'^'"*'' 


Therefore,  if  each  YiZ^  has,  on  the  average,  at  least  2  output  bits,  so  that  b^  a  2^*\  then 
b^^.25b^+1.5x2\  whence  n^>2  \  K  d^<2^'^^  then  an  adequate  estimate  for  n^  is  based 
on  the  fact  that  exactly  2^"^  of  the  s'th  least  significant  bits  will  be  I's: 


P' J  s  22'-'>2^'''\      b,<T-K 


It  follows  that  due  to  the  uncertainty  of  the  more  significant  X-bit  values  that  are  not  read 
during  t,  the  total  number  of  different  assignments  to  the  less  significant  bits  that  are  output 

during  t  is  at  least  H^x  ^2'^,  and  hence  A  =  n(— ]og-^^= — ).  n 

X    ■  r  n 

It  is  worth  summarizing  what  we  have  proved.    Let  W^  be  the  set  of  masks  that 
correspond  to  the  nxJk  bits  of  the  input  variables,  and  that  have  -^  I's  among  the  /ik  more 

significant  bits,  and  all  I's  among  the  (nk—itK)  less  significant  bits  (i.e.,  all  less  significant 
bits  are  revealed).  Let  C^^  be  the  set  of  all  partitions  of  these  less  significant  output  bits  that 
comprise  r  subsets.  Given  a  partition  qiQ  (with  subscripts  k  and  r  understood),  we 
represent  a  set  qi^q  by  its  mask.  We  have  proved,  in  Theorem  4,  that  for  a  sorting  fimction 
/,  (where  n/r  <  2*  <  (n/r)^,) 

BK\fqiQ  ^q,iq  \/w^W  ^z:  l{f(X)sqi\Xw  =  z)  =  ft(-^log-^^^). 

Furthermore,  this  bound  for  /  is  achieved  when  K=log— .    A  more  precise  statement  would 

include  the  fact  that  the  assignments  for  the  more  significant  X-bits  can  restricted  to  permuta- 
tions of  a  multiset  that  is  existentially  quantified  at  the  innermost  level,  and  that  is  known 
(i.e.  has  its  values  given  as  part  of  the  conditional  expression  for  /). 


It  should  be  noted  that  the  proof  is  somewhat  simpler  when  r  is  set  to  1,  since  each  X 
can  then  receive  a  unique  value  for  its  logn  more  significant  bits.  In  this  case,  the  less  signifi- 
cant bits  are  completely  uncorrected;  they  can  have  arbitrary  assignments  without  changing 
the  permutation.  The  consequence  of  all  this  is  that  the  variable  s  in  the  proof  of  Theorem  2 
can  be  fixed  at  j  =  *+l-logn.   Let  ik6[l+logn,l-071ogn],  and  let  W  be  the  set  of  masks  that 

correspond  to  the  «xjk  input  bits,  and  that  have  "  ^  I's  among  the  nlogn  more  significant 

bits,  and  all  I's  among  the  less  significant  bits.  Let  Q  be  the  set  of  all  subsets  containing  half 
of  the  less  significant  output  variables.  We  again  represent  qiQ  as  at  mask,  and  take  l^j  to  be 
the  number  of  ones  in  the  mask  (i.e.,  the  number  of  bits  revealed),  so  that  \q\=n(k-]ogn)/2. 
From  Theorem  1,  we  have  that  for  a  sorting  function/, 

^qiQ,wiW^z:l[f{X)>^\XKw  =  z]  =  n(|^|). 

It  is  reasonable  to  ask  if  the  above  bound  would  hold  if  Q  were  to  contain  contain  all  subsets 
of  less  significant  output  bits.    The  answer  turns  out  to  be  affirmative,  although  we  will 
show,  in  Corollary  1,  a  somewhat  weaker  result,  which  is  adequate  for  our  purposes.  In  [S2], 
the  following  stronger  result  is  established: 
For  *€[l+logn,21og/i],  and  Q  the  power  set  of  the  less  significant  output  bits, 

\^qiQ,wiW  Bz:  l(fiX)^\XKW  =  z]  a  \q\(.5- o(l)). 

Furthermore,  the  constant  .5  is  the  best  possible  (although  the  fimction  .51^1  is  only  tight  up 
to  a  factor  of  2  for  arbitrary  |^|.). 

The  weaker  result  will  follow  from  our  construction  for  the  case  Jk=1.071ogn  (and 

r=l).    Given  a  set  x  of  "  °^"  more  significant  input  bits,  the  construction,  in  this  can  be 

used  to  find  an  assignment  R  on  x,  the  multiset  (/={0,l,2,...,n- 1},  a  lower  order  assignment 
L,  and  a  rectangular  set  of  less  significant  output  bit  variables  'V  =  {Yjj\j€J,  l€[l  +  ]ogn,k], 
where yc[l,n]  and  |/|=n/32,  such  that 

/(  ^  I  (-^1 X„)  €  1/[roL  )  a  .25|'»P|. 

Moreover,  the  uncertainty  in  the  values  of  ^  will  be  a  uniform  .25  per  bit:  if  {C'^',  then 

/(  5  I  (^1 X„)  €  t/b,oL  )  s  .25|{|. 

We  deduce 
Corollary  1. 

Suppose  /i<2*sn^.  Let  x  be  a  set  of  "  °^"  input  bit  variables,  each  belonging  to  the 
log/i  more  significant  positions,  and  let  ^  be  a  set  of  output  bit  variables,  which  is  contained 
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in  {Yii  I  n/3si<2/i/3,  logn</}.  Then  there  is  a  multiset,  U,  of  n  (logn)-bit  numbers,  a  lower 
order  assignment,  L,  of  n  (fc-logn)-bit  numbers  to  X,  and  an  assignment,  R,  of  values  to  x 
so  that 

/(*l(Jfi X„)iU\^oL)  =  il(m- 

Proof:  Note  that  the  output  bits  '*'  belong  to  numbers  ranking  in  the  middle  third  of  the  n 
values.  Furthermore,  they  are  of  sufficiently  low  significance  that  they  need  not  affect  the 
ordering  of  the  y's. 

Put  v=n/3,  and  pick  X€[0,2v].   We  consider  the  problem  of  sorting  Xi,X2 X^  into 

}'i+j^,y2+x.  •  •  •  .l'v+,\-  Suppose,  for  the  moment,  that  Jk=1.071ogv.  A  fooling  set  construc- 
tion is  as  follows.    We  rename  the  X's,  if  necessary,  so  that  X^jXj,  .  .  .  ,X^  have  at  most 

"^°/^   bits  in  X-    We  set,  for  j>v,    2v-\  of  the  X,'s  to  2*-l,  and  X  of  them  to  0.    As 

described  in  the  remarks  following  Theorem  2,  we  construct  a  Cartesian  rectangle  ^  of 

- — rr-^  less  significant  output  bits,  a  multiset  t/={0,l,2,...,v-l},  an  assigimient  R  on  x. 

and  a  lower  order  assignment  L,  such  that 

/(  ^\iX, X,)  €  U\^oL  )  a  .251^1. 

The  point  of  this  construction  is  that  the  choice  of  lambda  merely  shifts  ^  with  respect 

to  the  Y's.  Specifically,  for  X  =  0,  ^={yy.i+iogv.iy.2+iogv l^.i-OTiop-btA.  for  some  set 

AC{l,2,...,v},  |A|=^.   For  general  X,  ^(X)={yy.i+i,g„y,.2+,„g,.  .  .  .  .l^.i+ioA-x.^.v   By 

averaging  over  X,  it  is  easy  to  show  that  for  some  shift  Xq,  |^(Xo)n^|a:|>I'|/64.  The  set  ^  in 
this  construction  is  easily  modified  for  the  general  case  n<2*<n^,  with  the  result  that 
|^(Xo)n^I'|=n(|^|),  whence  the  result  follows,  a 

Curiously,  Corollary  1  appears  to  be  too  weak  to  prove  a  good  area  bound  in  the  case 
(n/r)^s2*^n,  which  is  the  subject  of  Theorem  3.    The  obvious  application  is  to  combine  it 

with  the  fact  that  there  must  be  a  suitable  interval  of  time  during  which  at  most  "  °^"  more 

significant  bits  are  input,  and  during  which  n —    ^  suitable  less  significant  bits,  ^,  are  out- 

6r 

put.   The  conclusion  would  be  that  A  =  ft(— log ),  which  is  a  bit  weak  (the  best  bound  is 

n(— log ) ),  and  does  not  hold  for  the  desired  range  of  ik. 

r  n 

Theorem  3. 

A  when-  and  where-determinate  sorter  of  n  k-bit  numbers  that  inputs  its  data  r^n 
times,  where  (— )2<2*sn,  satisfies  Then  A  =  ft(— ). 


u 


2k 
Proof:   Put  tt  =  "Y''  ^  =  2".  We  will  apply  Corollary  1  to  a  sorting  problem  having  r\  Y's  and 

X'i. 

kn 

Standard  arguments  show  that  there  is  an  interval  of  time  t,  when  no  more  than  —  bits 

are  read,  which  belong  to  the  k  more  significant  positions,  and  when  — r-  bits  are  output, 

18r 

which  belong  to  the  outputs  Y^,  where  n/3sj<2n/3,  and  />k.   Let  the  y's  having  these  less 

kn       k2' 
significant  output  bits  comprise  {1'x}\€A-   Clearly  |A|s— — ^— — .   Furthermore,  it  is  easy  to 

18r       18 

Check       that       ■^<Y-        So       |A|<^.       Let       Ai={l,2,...,L(ii-|A|)/2j},       and 

A2={y-y-+1 Y""^  r('n-|A|)/2l-l}.    Put  A=AiUAUA:.    It  suffices  to  assume  that 

n>3,  whence  -^>-^>-^^^— r^ — '-.  This  inequality  ensures  that  the  indices  in  A  are  between  the 
indices  of  A^  and  A2.  Qearly  |A|='ri.  We  rename  the  X's,  if  necessary,  so  that 
Xi,X2 X^  have  at  most  -^  bits  which  are  input  during  t,  and  which  belong  to  the  k 

more  significant  positions.  This  is  possible  since  on  the  average,  each  X  has  only  -^  more 
significant  X-bits  read  during  t.  We  apply  Corollary  1  to  this  reduced  problem,  and  conclude 
that  A  =  n(— )  for  sorting  {X,}JLi  into  {i'x},v«x- 

To  complete  the  proof,  we  must  ensure  that  the  larger  problem  contains  our  specific  t\- 
number  sorting  problem.  This  is  easy.  It  suffices  to  give  the  remaining  inputs,  {X,}".^^^  fixed 
values,  assigned  in  batches.  Let  A={X,}p.|,  where  Xi<X2<  •  •  ■  "^^t)-  ^^  ^o~0.  Let  n  —  k^ 
X's  be  2*-l,  and,  for  i  =  1,2,...,ti,  set  X,-X,_i-1  X's  equal  to  ^,.  Define  the  ^'s  so  that 
5(^=0  for  />K,  and  ?/_;=l'x  ;  ^°^  '^•^-  ^^  •*  ^**y  ^o  ^^^  ^^*  these  X's  have  the  right  rankings; 
they  fill  the  Y  gaps  correctly.  □ 

It  should  be  observed  that  in  Theorem  3,  we  construct  a  permutation  on  a  thin  set 
(|'*I'|«n).  The  construction  given  in  Theorem  2  cannot  be  directly  applied  in  this  case 
(unless,  for  technical  reasons,  it  turns  out  that  |^|  is  extremely  small)  because  almost  all  of 
the  output  bits  may  be  mostly  "wasted."  Theorem  2  succeeded  by  constructing  a  permuta- 
tion only  among  y's  having  output  bits  in  "9;  no  y's  without  such  timely  less  significant  out- 
put bits  were  used  in  the  permutation  construction.  The  difficulty  with  using  these  "empty 
y*s"  is  to  make  certain  that  they  are  not  overcommitted,  that  is,  to  ensure  their  less  signifi- 
cant assignments  are  not  needed  by  members  of  ^  assigned  to  many  different  X  "pools". 
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Theorem  2  and  Theorem  3  can  be  combined  to  read, 

Corollary  2. 

A  when-  and  where-determinate  sorter  of  n  it-bit  numbers  that  inputs  its  data  r^n 

times,  where  — <2*<n°(^),  satisfies  A  =  n(—log-^^-^).  a 
r  r  n 

The  following  lemma  improves  our  lower  bound  when  r  is  large. 

Lemma  1. 

For  any  number  of  replicated  inputs,  the  area  of  a  circuit  that  sorts  n  iir-bit  numbers 
must  satisfy,  A  =  ft(logn). 

Proof:  The  reason  for  this  bound  is  that  a  circuit  that  inputs  rnk  bits  and  outputs  nk  must 
have  enough  internal  states  to  be  able  to  count,  effectively,  to  niin(r,n).  Adding  logr,  for 
r^n,  to  the  bounds  in  Theorem  1  or  Corollary  2  is  equivalent  to  adding  n(logn).  □ 

3.  AT^  Bounds 

The  proof  of  Theorem  2,  in  the  case  r=  1,  shows 
Theorem  4. 

Suppose  n^2''^n^^^K    Then  a  when-  and  where-determinate  circuit  that  sorts  n  k-hit 

numbers  and  reads  the  inputs  once  must  satisfy  Ar'^=ft(n^log^ ). 

n 

Proof:    We  may  assume  Jt^H-logn,  as  otherwise  we  can  replace  n  with  n/2.    We  ciui  also 

n        2* 
assume  T<— log — ,  as  otherwise  the  result  is  trivially  true.    The  chip  is  partitioned  in  the 
3        n 

n        2* 
standard  manner  so  that  each  side  of  the  circuit  outputs  at  least  —log —  bits,  which  are  of 

3  H 

significance  1-t-logn  or  below.  At  least  half  of  the  more  significant  bits  (i.e.,  of  logn  signifi- 
cance or  better)  must  be  input  on  one  side  of  the  circuit,  say,  the  left.  Then  the  proof  of 

2* 
Theorem  2  shows  that  the  right  side  needs  ft(/ilog-=-)  bits  of  information  from  the  left,  if  the 

n 

circuit  is  to  sort  correctly,  n 

Note  that  the  input  distribution  of  less  significant  X-bits  has  not  been  specified,  insofar  as  the 
partitioning  line  (as  well  as  the  choice  of  sides)  is  concerned.  The  proof  thus  permits  the 
right  side  of  the  circuit  to  know  all  of  these  assignments!  Since  we  are  using  an  information 
theoretic  (i.e.  counting)  argument,  we  may  suppose  that  the  left  side  is  an  oracle,  and  knows 
the  values  of  all  the  input  bits.  The  information  flow  is  now  unidirectional;  we  are  estimating 
the  amount  of  essential  information  that  the  right  side  of  the  circuit  is  still  missing.  The 
point  of  this  formulation  is  that  these  "benefits  of  the  doubt"  are  of  no  help  whatsoever  (up 
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to  a  constant  factor).  The  information  that  must  cross  the  partition  is  proportional  to  the 
information  content  of  the  less  significant  output  bits.  Moreover,  we  need  no  new  proof  to 
establish  the  fact. 

For  completeness,  we  note  that  in  [SI],  a  multiple  cut  argument  is  used  to  prove    . 

Theorem  5  [SI]. 

Suppose  2''^n.  Then  a  when-  and  where-determinate  circuit  that  sorts  n  ilc-bit  numbers 
and  reads  the  inputs  once  must  satisfy  AT^  =  ft(n2*).  a 

When  it  s  (l-l-o(l))logn,  there  are  actually  two  AT^  bounds.  We  have  as  yet  made  no 
restrictions  on  how  the  input  and  output  ports  are  distributed  in  the  sorting  circuit.  For  VLSI 
models,  the  I/O  ports  are  typically  required  to  be  located  along  the  perimeter  of  the  circuit 
(or  more  precisely,  the  convex  hull).  We  may  assume,  without  loss  of  generality,  that  the  cir- 
cuit is  rectangular,  with  length  L,  and  width  W  ^  L.  Ullman  [U]  points  out  that  in  cases  res- 
tricted to  perimeter  I/O,  there  is  an  LT  bound  based  on  the  requirement  to  read  the  the  input 
data:  ALT  =  ft(nJk).  If  the  information  that  must  flow  across  the  circuit  is  /,  then  taking  a 
Thompson  cut  along  the  shorter  direction,  gives  WT  =  0(7),  so  that  AT^  =  ft(/«Jk).  Our  con- 
tribution, in  this  case,  is  the  determination  of  7.    The  arguments  for  Theorem  4  gives, 

2^+1 

7  =  n(nlog ).  Consequentiy,  we  have 

n 

Theorem  6. 

Suppose  n-^2''-<n9^^\  Then  a  when-  and  where-determinate  circuit  that  sorts  n  ik-bit 
numbers,  reads  its  inputs  once,  and  has  its  I/O  ports  along  its  perimeter  must  satisfy 

AT^=£l{kn-\og- ).  Q 

n 

In  particular,  as  shown  in  [U],  for  *  =  logn,  AT^  =  ft(n^logn). 

Additional  results  on  AT^  lower  bounds  for  sorting  can  be  found  in  [S2],  and  optimal 
sorters  for  various  length  numbers  are  constructed  in  [CS]. 

4.  Upper  Bounds 

Circuits  for  sorting  n  Jk-bit  numbers  were  presented  in  [SI].  Theorems  7  and  8  are  sim- 
ple applications  of  the  sorting  schemes  given  there.  The  only  difference  is  that  since  the 
inputs  can  be  read  r  times,  it  suffices  to  use  the  sorting  schemes  to  output,  for  input  phase  j, 
y i-^j„/r,y2+jn;r ^n/r+jn/r'  for  7=0,1,. ...f- 1.  Such  a  circuit  requires  sufficient  size  to  con- 
tain pass  j's  intended  output  of  n/r  sorted  Jk-bit  numbers,  the  (largest)  X-index  of  the  max- 
imum value  so  stored,  a  counter  possibly  going  up  to  n,  a  small  amount  of  controller  logic, 
and,  say,  two  temporary  registers  of  k-  and  logn-bits.    As  a  consequence  of  the  data  storage 
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schemes  in  [SI],  it  follows  that 
Theorem  7. 

For  input  replicated  r  times,  a  set  of  n  ifc-bit  numbers,  where  2*^—,  can  be  sorted  in 
areaA  =  0(2Mog^—  +  logn). 

This  result  is  clearly  optimal  for  all  values  of  r. 
Theorem  8. 

For  input  replicated  r^n  times,  a  set  of  n  ifc-bit  numbers,  — ^2*sn^,  can  be  sorted  in 

areaA  =  0(-log-^^^  +  logn). 
r  n 

This  result  is  also  optimal. 

The  next  sorting  construction  is  optimal  for  some  values  of  r  and  k,  and  provably 
suboptimal  for  others. 

Theorem  9. 

For  input  replicated  r^n  times,  a  set  of  n  ifc-bit  numbers,  where  2*>n*,  can  be  sorted  in 
area 

A  =  0(-logn)  +  0(0), 

where  G  =  min  {n,  — ,  — rlogn}. 

r      r-^ 

Discussion:    For  r<logn,  n  is  dominated  by  —logn,  and  hence  the  area  bound  is  tight  for 

r  =  O(logn).  The  first  term  in  the  expression  for  G  comes  from  using  n  input  pads,  one  for 
each  number,  and  marking  which  inputs  have  been  already  sorted.  If  other  pad  markings 

indicate  which  inputs  are  candidate  maximum  values  for  the  current  —  group  of  values 

currently  being  sorted,  and  which  are  evidently  smaller  values  in  the  group,  then  only  0{n) 

bits  are  needed  in  addition  to  the  number  necessary  for  a  rank  and  forward  scheme  on  — 

numbers.  The  second  term  arises  from  the  obvious  approach  of  storing  the  next  —  numbers 
to  sort.  The  third  results  from  intermediate  passes  to  locate  the  addresses  (indices)  of  the 
current  -^  numbers  to  sort.  In  this  case,  all  pairs  of  numbers  are  compared  in  each  of  r^ 

sorting  phases,  and  the  addresses  of  suitably  ranked  values  are  stored. 


15 


A  naUiral  question  arises  as  to  the  why  the  gap  ocairs  for  sorting  with  repeated  inputs 
and  large  k.  Part  of  the  discrepancy  is  caused  by  the  format  of  the  output  schedule,  and  the 
failure  of  our  techniques  to  capture  and  to  quantify  adequately  two  constraints  inherent  to  the 
format.  If  output  were  not  required  to  be  when  determinate,  but  instead  were  required  to 
comprise  ordered  pairs  (i,j;),  where,  <  is  the  ranic  of  input  value  x,  then  the  solution  would  be 
meaningful,  and  the  circuit  would  certainly  be  a  sorter.  It  is  not  difficult  to  construct  a  circuit 
which  sorts  in  this  manner,  reads  the  input  r  times,  and  which  uses  an  area  proportional  to 

— logn  area.  Furthermore,  if  such  sorted  data  were  then  used  as  input,  another  kind  of  sorter 
could  be  designed  that  reads  its  data  r  times,  produces  the  standard  where-  and  when- 
determinate  output,  and  that  has  area  A  =  — logn.  Both  tasks  are  easy.  The  obvious  composi- 
tion, however,  takes  r^  input  passes  for  area  —logn.  The  result  is  an  alternative  construction 
giving  the  -^logn  term  for  G  in  Theorem  9. 

Also,  the  information  theoretic  natiire  of  the  proofs  ensures  that  for  replicated  inputs, 
the  lower  bounds  arguments  would  hold  if  an  oracle  were  consulted  r  times.   A  good  oracle, 

for  example,  might  return  the  addresses  of  the  next  —  numbers  to  output,  whence  the 
requisite  area  would  be  (?(— logn).  This  bound  accounts  for  an  0(— log— )  rank  and  forward- 
ing cost  plus  an  0(— logr)  area  to  store  the  addresses.  Evidentiy,  a  more  delicate  decomposi- 
tion of  time  and  information  flow  is  needed  to  establish  better  lower  bounds  in  this  case. 
We  now  describe  a  sorting  scheme  that  reduces  the  bounds  gap  for  some  cases. 

4.1.  Partitioning  techniques  for  muitilective  sorting 

In  [SI],  a  rank  and  forward  sorting  scheme  is  shown  to  work  well  when  the  number  of 
bits  is  large,  say,  for  k>2\ogn.  An  essential  aspect  of  the  scheme,  however,  is  to  be  able  to 
input  all  the  numbers  in  bitwise  parallel.  The  purpose  of  this  section  is  to  address  the  prob- 
lem of  rank  and  forward  schemes  when  the  area  is  less  than  n.  As  in  the  previous  section,  the 
sorting  will  proceed  on  consecutively  ranking  values.  The  problem  is  how  to  determine  which 
inputs  to  sort  in  a  given  pass,  so  that  rank  and  forward  sorting  can  be  applied  to  the  selected 
variables. 

Our  rank  and  forward  sorting  for  large  r<n  is  based  on  partitioning.  We  suppose  the 
scheme  sorts  x  numbers  per  phase.  Evidently,  it  suffices  to  know  the  addresses  of  the  x 
numbers  for  the  current  phase.   The  problem  is  how  to  identify  them,  and  how  to  remember 
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them.  We  describe  methods  which  store  the  addresses  of  the  numbers  to  be  sorted  in  the 
current  phase.  In  view  of  Theorem  9,  these  addresses  can  be  stored  without  appreciably 
increasing  the  area  required  to  sort;  the  real  difficulty  is  how  to  identify  the  numbers.  One 
way  to  identify  them  is  by  ranking  all  n  numbers  during  each  phase.  The  addresses  are  likely 
to  require  storage  even  for  this  naive  sorting  scheme  if  the  output  is  to  be  when-determinate, 
since  specific  rankings  might  be  discovered  at  an  indeterminate  time,  when  not  all  the  bits  are 
available  or  are  scheduled  to  be  output.  This  method  can  be  implemented  by  comparing  all 
pairs  of  numbers.  One  possible  implementation  is  outlined  in  the  remarks  following 
Theorem  9. 

We  now  describe  sorting  algorithms  which  are  better,  at  least  when  k  is  not  too  large. 
They  will,  however,  require  Cl(k)  storage.  Since  the  methods  are  somewhat  complicated, 
inferior  algorithms  will  be  presented  first,  and  modifications  will  be  subsequently  outlined 
and  analyzed. 

In  the  following  discussion,  it  is  convenient  to  assume  that  all  input  values  are  distinct. 
This  assumption  causes  no  difficulty,  since  the  sorting  concerns  the  case  when  k>2]ogn,  and 
we  can  then  use  lexicographical  sorting  with  the  index  (address)  of  each  input  variable  as  the 
secondary  key. 

Evidently  the  relevant  addresses  can  be  computed  in  one  pass  if  the  the  largest  and 
smallest  values  in  the  oirrent  sorting  phase  are  known. 

A  reasonable  scheme  to  sort  x  numbers  per  pass  might  be  to  initially  store  the  addresses 
of  the  /j'th  largest  numbers,  for  h  =  jc,2;c,4ac,&x,....  We  call  such  a  sequence  an  ordered  par- 
titioning branch.  These  values  could  be  found  via  techniques  to  compute  order  statistics. 
Then  only  a  small  amount  of  computation  might  be  necessary  to  update  the  sequence  for  each 
new  sorting  phase. 

It  turns  out  that  this  scheme  can  be  implemented.  Improvements  are  based  on  modifying 
the  standard  order  statistic  algorithms,  relaxing  the  requirement  that  they  find  exact  order 
statistics,  and  relaxing  the  requirement  that  the  rankings  of  the  elements  (represented)  in  the 
ordered  partitioning  branch  be  exact. 

Given  a  set  S,  define  5^  ^.  to  be  {5€5[x<ssy}.  Given  a  finite  set  5  with  \S\=c,  we  call  a 
sequence  —<x>^sq<s^<s2<  '  '  '  ■^*rf'*^*d+i~*>  an  x-k-decomposition  of  S  if 

(1)  jq"  ~°°'  if  k=Q,  and  sq  is  the  itc'th  smallest  member  of  S,  if  ik>0. 

(2)  {s^,...sa]<ZS. 
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(3)  3x>\S,^,}^x. 

(4)  For  1  =  1,... ,d,  c2-'2:  |5,__^,J>c2-'-^  implies  c2-'+^2:|S,^,^J>c2-'-^.     . 

(5)  For  .•=0,1,.. .,^-2,  c2''^\S,_,J>c2-'-'  impUes  \S,^^^,J>c2-'. 

The  addresses  of  the  values  Sq.  .  .  .  .s^  will  form  our  (generalized)  x-k-ordered  parti- 
tioning branch  of  S.  The  requirements  1  through  5  have  simple  interpretations.  They  are 

(1)  sq  is  the  largest  value  sorted  so  far  (or  -«,  if  none  have  been  sorted). 

(2)  The  values  are  in  S. 

(3)  The  partition  of  S  with  the  smallest  values  contains  enough  values  for  the  current  output 
phase,  and  not  too  many  more. 

(4)  U  we  call  the  half  open  intervals  (c2~'~^,c2~']  slot  /,  then  the  sizes  of  two  consecutive 
partitions  in  the  ordered  partitioning  of  S  belong  to  the  same  or  consecutive  slots. 

(5)  The  sizes  of  at  most  three  (consecutive)  partitions  of  S  can  belong  to  a  single  slot. 

We  remark  that  it  will  soon  be  convenient  to  relax  the  definition  of  an  jc-ik-ordered  par- 
titioning to  allow  some  empty  slots.  The  reason  for  this  next  (and  last)  modification  is  to 
define  an  invariant  that  allows  an  ordered  partitioning  to  be  updated  in  each  sorting  phase 
with  only  one  application  of  a  subroutine  call.  The  subroutine  will,  given  addresses  of  parti- 
tion   values    s,    and    s,^^,    return    an    address    of    a    value    s,    where    Si<s<Si+;,    and 

2  1 

y|5,  j^  l^l'^f  J    l^xl-^j  J    !•  Th*^*  ws  J^^d  a  separator  algorithm  which  partitions  a  set  into 

two  pieces,  and  the  larger  piece  contains  the  larger  values  and  has  a  size  at  most  two  thirds  of 
the  original  set. 

We  postpone  a  discussion  of  the  separator  subroutine  to  concentrate  on  the  other 
aspects  of  the  algorithm.  The  separator  is  used,  of  course,  to  split  a  subset  $„ ,.  into  two 
pieces,  5^^,  and  S^^.  It  is  simple  to  verify  that  if  5^,;  occupied  slot  /-I,  then  there  are  only 
three  possibilities  for  the  new  pieces.  Either  5^^.  occupies  Z-1  Jind  5^^  occupies  /,  or  both 
occupy  slot  /,  or  5ft  ^  occupies  slot  /,  and  5^^  occupies  l+l.  The  point  of  this  organization 
will  be  that  in  any  case,  the  formerly  empty  slot  /  will  get  a  subset. 

The  sorting  algorithm  is: 

Create  jc-0-ordered  partition  branch; 
While  not  done  do 

Sort  X  smallest  numbers  remaining  in  a  partition; 

Call  separation  algorithm  and  fill  lowest  empty  slot  (if  any)  by 
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splitting  the  lowest  partition  in  the  next  larger  slot 
End. 

We  identify  as  the  bottom  slot  a  pair  of  slots,  namely  the  slot  containing  S^  ^  and  the 

next  larger  slot.  Pfigher  slots  remain  uniquely  named.  We  aie  willing  to  sort  all  the  members 
of  a  single  partition  (having  the  smallest  values)  in  the  bottom  slot,  if  the  partition  contains  x 
or  more  members.  Otherwise  we  merge  the  set  with  the  next  piece  and  sort  that.  Correctness 
follows  from  the  following  invariant.  Between  closest  (internal)  empty  slots  is  a  slot  contain- 
ing two  or  more  subsets.  That  is,  if  slots  j  and  l>j  are  empty  (and  above  the  bottom  slot), 
then  either  there  are  two  subsets  with  sizes  in  slot  h,  for  some  h:  j<h<l,  or  all  members  of 
the  partition  have  sizes  belonging  slots  below  j.  A  simple  induction  argument  will  show  that 
this  property  is  preserved.  It  is  clearly  sufficient  to  guarantee  that  two  (internal)  empty  slots 
cannot  be  consecutive,  whence  the  lowest  empty  slot  can  always  be  filled.  In  particular,  the 
bottom  slot  will  never  be  empty  (imtil  all  the  numbers  are  sorted). 

Creating  the  initial  ordered  branch  is  also  simple.  The  details  are  omitted.  We  now  con- 
sider separator  algorithms. 

Suppose  we  read  /  numbers  at  a  time,  and  use  sorting  circuitry  to  find  the  median  of  the 
input.  The  standard  algorithm  [AHU]  for  /=5  reads, 

OrderS(5,  m) 

If  l5|s/  Return  (m'th  largest  value  in  5); 
Partition  S  into  \S\/l  pieces  of  size  /; 
S  -  U {median  of  each  piece}; 

ii  ^  OrderS(i,^); 

5, -{5€5|s<jjl}; 
S,'-{siS\s  =  n.}; 

m,  ►  |5,|; 

m*  -  \Se\; 

If  m,^m  Retain(OrderS(St,m)) 
Else  if  m,>m-m,  Retum(Orrf*r5(5„m-m,)) 
Else  Retuin(OrderS(Sg,m- ntg- m,)); 
End. 

This  method  will  be  modified  as  follows: 
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(1)  Approximate  order  statistics  are  found,  rather  than  exact  ones. 

(2)  A  depth  first  iteration  is  used,  instead  of  breadth  first  recursion. 

(3)  Rather  than  return  a  median  from  one  level  to  use  as  a  separator  at  the  next,  separators 
from  the  deepest  level  are  returned  to  the  top  level. 

The  consequences  are 

(1)  A  savings  of  input  passes 

(2)  A  savings  of  input  passes  and  area 

(3)  A  savings  of  input  passes  at  a  cost  of  finding  a  very  weak  (preliminary)  separator. 

Then  I  can  be  tuned  for  the  best  performance. 

First,  though,  it  should  be  noted  that  the  standard  algorithm  as  given  is  materially  sim- 
plified when  /  is  chosen  optimally.  The  reason  is  that  if  the  algorithm  finds  medians  of  / 

numbers  at  a  time,  then  y  median  addresses  must  be  stored,  whence  a  good  separator  can  be 

found  in  2  passes,  and  the  area  for  this  portion  of  the  algorithm  is  /+ylog/.   If  x  numbers 

are  sorted  per  phase,  then  the  circuit  can  be  implemented  in  area  A  =  0(Jt+xlogrt  +  /+  ylog/) 

and  input  passes  r  =  0(— ).  We  set  /=(«logn)-^,  whence  A  =  0(it+(/ilogn)-'+— logn). 

One  difficulty  with  the  algorithm  as  outlined  is  that  it  is  not  where-  determinate.  The 
difficulty  occurs  at  the  outer  level.  When  only  a  subset  of  the  n  numbers  are  to  be  processed 

(due  to  partitioning),  the  -y  medians  found  in  the  first  pass  will  not  be  medians  of  uniformly 

sized  subsets  if  the  outer  level  is  to  be  completed  in  one  pass.   Thus  either  additional  passes 
are  needed,  or  weighted  median  techniques  must  be  used. 

It  turns  out  that  weighted  medians  can  indeed  be  sensibly  used,  and  the  area  bound  is 
affected  by  no  more  than  a  constant  factor.  The  details  are  omitted. 

In  the  modified  depth  first  method,  log^n  groups  of  /  addresses  are  stored,  one  group 
for  each  level  of  depth  in  the  algorithm.  The  following  skeletsil  description  shows  the  organi- 
zation, but  omits  critical  details  essential  to  correctness  and  efficient  operation. 

Algorithm  Weaksep 

Procedure  Load(k,d) 

U  d=0,  k  ■^  address  of  median  of  /  current  values; 

Load(k,l) 
Else  insert  k  into  addressbucket(d); 
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If  addressbucket(d)  is  (possibly)  full  then 

k  -  address  of  median  of  values  represented  in  addressbucket(d); 

Load(k,d+l); 
Return; 
End  Load; 

Repeat 

While  a  group  of  /  inputs  are  left  unread  Do 

Load(k,0); 
Flush  lower  levels  (or  assume /!  =  /'). 
Return  final  address. 
Partition  based  on  value  at  address; 
With  bigger  subpiece  until  suitable  separator  found; 
End  Weaksep. 

It  should  be  noted  that  due  to  when-determinism  of  the  input,  separator  computations  at 
level  d  require  /^  values  to  be  read,  so  that  all  values  corresponding  to  the  stored  addresses 
can  be  compared.  At  the  top  level,  the  separator  is  only  computed  for  a  subset  of  the  inputs, 
which  lie  between  two  partitioning  values.  Consequently,  a  when  determinate  operation  of 
this  scheme  that  uses  input  passes  efficiently  will  empty  buckets  which  are  not  full.  As  a 
consequence,  the  median  computations  are  weighted.  The  partitioning  values  that  are  found 
are  very  weak  separators.   Indeed,  after  log,n  levels  of  recursion,  a  weak  separator  is  found 

which  partitions  the  original  set  into  two  pieces,  neither  smaller  than  n  °''^.  It  is  easy  to  see 
that  if  the  set  containing  the  median  is  partitioned  further,  and  the  process  repeated,  then  a 
strong  separator,  which  divides  the  set  into  two  pieces,  none  smaller  than  one  fourth  of  the 

original  set,  is  found  in  (■^)°*'^  iterations.    Moreover,  a  2/3  --  1/2  separator  can  be  con- 

structed  by  applying  a  small  number  of  calls  to  a  separator  algorithm  that  partitions  sets  into 
pieces  of  size  1/4  or  larger. 

If  values  (rather  than  addresses)  were  stored,  Weaksep  could  be  implemented  with  the 
input  read  only  once.  When  addresses  are  stored,  a  weak  separator  can  be  found  in  log,n 
passes  over  the  input,  which  corresponds  to  a  (distributed)  input  pass  for  each  bucket  level. 

Hence  0((-^)  °*'*log,n)  input  passes  are  needed  per  partitioning  phase. 

Recalling  that  I  numbers  valued  0  to  n-l  can  be  stored  in  /logy  area,  and  that  the 
depth  of  the  separator  finding  scheme  is  log,n,  we  conclude  that  the  area  for  the  separation 
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operation  is  /logylog,n.  If  x  numbers  are  to  be  sorted  in  each  sorting  phase,  then  their 
addresses  can  be  found  in  one  input  pass  if  surrounding  values  are  known.  The  area  to  store 
the  addresses  is  0(x\og—).  The  sorting  can  be  done  in  one  more  input  pass.  Our  ordered 

partition  stores  log—  partitioning  addresses,  and  is  updated  only  once  per  sorting  phase. 
Thus     the     total     number     of     input     phases     (including     initialization     passes)     is 

r=0{n°^'^\og,n\og-+—n°*^\og,n)  and  the  area  is  A  =  0(ilc+;clogn  +  /lognlog,n).    Set  /=2', 


where  8  is  fixed.  Then  A  =  0(k+ log^n). 

1 

n                                   2  ^'^   n 
If  r^n"  where  a<l  ,  setting  Z=—  gives,  A  =  0(k+- logn).   The  corresponding 

expression  is  a  bit  tamer  for  l  =  2n/r. 

Theorem  10. 

For  input  replicated  r^n"  times  where  <t<1,  a  set  of  n  k-bit  numbers  can  be  sorted  in 

area  A  =  0(— logn  +  k).  a 

Evidently,  if  J^=0(— logn),  then  such  a  sorting  circuit  can  be  implemented  with  an  area  that 

is  within  a  constant  factor  of  optimal. 

5.  Conclusions 

We  have  shown  that  for  inputs  read  r^n  times,  n  integers  in  the  range  [0,m-l]  can  be 
sorted  in  area 


A{n,m,r)  =  0 


logn  +  mlog-^,        m^n/r, 

rm 

logn  +  !L\og2U!L,      n/r-&m<n^ 


These  bounds  also  apply  to  a  bit  modeled  sorting  network,  since  the  sorters  are  comprised  of 
little  more  than  circular  shift  registers,  counters  and  comparators.  These  devices  can  be  built 
with  bounded  fan  in  and  fan  out,  and  can  operate  in  a  systolic  or  (partially)  self  timed 
manner,  without  the  benefit  of  global  clock  or  control  lines. 

For  a  VLSI  circuit  model,  the  techniques  show  that  if  the  inputs  are  read  once,  then  a 
when-    and  where-determinate  circuit  that   sorts  n   integers   in   the   range   [0,m— 1],    for 

n<^m^n°^'^\  must  satisfy  Ar^  =  n(n^log^ ).   Moreover,  this  bound  is  the  best  possible  for 

n 
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the  proof  technique  used. 

It  is  worth  observing  that  the  proof  of  the  AT^  bound  is  based  on  the  inevitable  niisallo- 
cation  of  the  more  significant  input  bits.  The  distribution  of  less  significant  input  bits  is 
irrelevant.  In  this  sense,  the  standard  proof  for  ik=l  +  log/i  is  somewhat  misleading,  since  it 
draws  attention  to  the  locations  of  the  least  significant  inputs. 

This  misallocation  of  more  significant  bits  shows  a  little  more  about  AT^  bounds.  In 
many  applications,  records  are  sorted  according  to  a  key  field.  We  have  shown  that  if  the 
entire  record  is  to  be  sorted  according  to  a  read-once,  when-  and  where-  determinate  format, 
then  the  information  content  of  the  data  field  affects  the  AT'  bound.  For  example,  suppose 
that  n  records  are  to  be  sorted,  that  each  record  has  a  log/i-bit  sort  key,  and  that  each  record 
has  a  data  field  comprising  0(logn)  bits  of  arbitrary  information.  Then  AT^  =  ft(n*log-n). 
Circuits  cannot  beat  this  bound  unless  the  data  field  is  short,  or  the  data  is  not  arbitrary. 
Analogous  remarks,  evidently,  hold  for  sorting  circuits  of  minimum  area. 
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